NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Chiang, Hung-Yueh; Chang, Chi-Chih; Frumkin, Natalia; Wu, Kai-Chiang; Abdelfattah, Mohamed S; Marculescu, Diana (July 2025, ICML 2025 https://icml.cc/virtual/2025/poster/44833)

Free, publicly-accessible full text available July 15, 2026
Quamba: A Post-Training Quantization Recipe for Selective State Space Models

Chiang, Hung-Yueh; Chang, Chi-Chih; Frumkin, Natalia; Wu, Kai-Chiang; Marculescu, Diana (April 2025, ICLR 2025 https://iclr.cc/virtual/2025/poster/28449)

Free, publicly-accessible full text available April 24, 2026
"Efficient Low-rank Backpropagation for Vision Transformer Adaptation"

Yang, Yuedong; Chiang, Hung-Yueh; Li, Guihong; Marculescu, Diana; Marculescu, Radu (December 2023, ACM)

The increasing scale of vision transformers (ViT) has made the efficient finetuning of these large models for specific needs a significant challenge in various applications. This issue originates from the computationally demanding matrix multiplications required during the backpropagation process through linear layers in ViT. In this paper, we tackle this problem by proposing a new Low-rank Back-Propagation via Walsh-Hadamard Transformation (LBP-WHT) method. Intuitively, LBP-WHT projects the gradient into a low-rank space and carries out backpropagation. This approach substantially reduces the computation needed for adapting ViT, as matrix multiplication in the low-rank space is far less resource-intensive. We conduct extensive experiments with different models (ViT, hybrid convolution-ViT model) on multiple datasets to demonstrate the effectiveness of our method. For instance, when adapting an EfficientFormer-L1 model on CIFAR100, our LBPWHT achieves 10.4% higher accuracy than the state-of-the-art baseline, while requiring 9 MFLOPs less computation. As the first work to accelerate ViT adaptation with low-rank backpropagation, our LBP-WHT method is complementary to many prior efforts and can be combined with them for better performance.
more » « less
Full Text Available
MobileTL: On-Device Transfer Learning with Inverted Residual Blocks

https://doi.org/10.1609/aaai.v37i6.25874

Chiang, Hung-Yueh; Frumkin, Natalia; Liang, Feng; Marculescu, Diana (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

Transfer learning on edge is challenging due to on-device limited resources. Existing work addresses this issue by training a subset of parameters or adding model patches. Developed with inference in mind, Inverted Residual Blocks (IRBs) split a convolutional layer into depthwise and pointwise convolutions, leading to more stacking layers, e.g., convolution, normalization, and activation layers. Though they are efficient for inference, IRBs require that additional activation maps are stored in memory for training weights for convolution layers and scales for normalization layers. As a result, their high memory cost prohibits training IRBs on resource-limited edge devices, and making them unsuitable in the context of transfer learning. To address this issue, we present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with IRBs. MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Also, MobileTL approximates the backward computation of the activation layer (e.g., Hard-Swish and ReLU6) as a signed function which enables storing a binary mask instead of activation maps for the backward pass. MobileTL fine-tunes a few top blocks (close to output) rather than propagating the gradient through the whole network to reduce the computation cost. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively. For MobileNetV3, we observe a 36% reduction in floating-point operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6% accuracy reduction on CIFAR10. Extensive experiments on multiple datasets demonstrate that our method is Pareto-optimal (best accuracy under given hardware constraints) compared to prior work in transfer learning for edge devices.
more » « less
Full Text Available

Search for: All records